{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Embedding Metadata for Improved Retrieval\n",
    "\n",
    "\n",
    "- **Level**: Intermediate\n",
    "- **Time to complete**: 10 minutes\n",
    "- **Components Used**: [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore), [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever), [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder), [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder)\n",
    "- **Goal**: After completing this tutorial, you'll have learned how to embed metadata information while indexing documents, to improve retrieval.\n",
    "\n",
    "> This tutorial uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).\n",
    "\n",
    "> ⚠️ Note of caution: The method showcased in this tutorial is not always the right approach for all types of metadata. This method works best when the embedded metadata is meaningful. For example, here we're showcasing embedding the \"title\" meta field, which can also provide good context for the embedding model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "While indexing documents into a document store, we have 2 options: embed the text for that document or embed the text alongside some meaningful metadata. In some cases, embedding meaningful metadata alongside the contents of a document may improve retrieval down the line. \n",
    "\n",
    "In this tutorial, we will see how we can embed metadata as well as the text of a document. We will fetch various pages from Wikipedia and index them into an `InMemoryDocumentStore` with metadata information that includes their title, and URL. Next, we will see how retrieval with and without this metadata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "### Prepare the Colab Environment\n",
    "\n",
    "- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)\n",
    "- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/setting-the-log-level)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Haystack\n",
    "\n",
    "Install Haystack 2.0 and other required packages with `pip`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: haystack-ai in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (2.0.0b7)\n",
      "Collecting wikipedia\n",
      "  Downloading wikipedia-1.4.0.tar.gz (27 kB)\n",
      "  Preparing metadata (setup.py): started\n",
      "  Preparing metadata (setup.py): finished with status 'done'\n",
      "Requirement already satisfied: boilerpy3 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (1.0.7)\n",
      "Requirement already satisfied: haystack-bm25 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (1.0.2)\n",
      "Requirement already satisfied: jinja2 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (3.1.3)\n",
      "Requirement already satisfied: lazy-imports in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (0.3.1)\n",
      "Requirement already satisfied: more-itertools in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (10.2.0)\n",
      "Requirement already satisfied: networkx in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (3.2.1)\n",
      "Requirement already satisfied: openai>=1.1.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (1.12.0)\n",
      "Requirement already satisfied: pandas in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (2.2.0)\n",
      "Requirement already satisfied: posthog in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (3.4.1)\n",
      "Requirement already satisfied: pyyaml in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (6.0.1)\n",
      "Requirement already satisfied: tenacity in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (8.2.3)\n",
      "Requirement already satisfied: tqdm in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (4.66.2)\n",
      "Requirement already satisfied: typing-extensions in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-ai) (4.9.0)\n",
      "Collecting beautifulsoup4 (from wikipedia)\n",
      "  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.0.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from wikipedia) (2.31.0)\n",
      "Requirement already satisfied: anyio<5,>=3.5.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from openai>=1.1.0->haystack-ai) (4.2.0)\n",
      "Requirement already satisfied: distro<2,>=1.7.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from openai>=1.1.0->haystack-ai) (1.9.0)\n",
      "Requirement already satisfied: httpx<1,>=0.23.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from openai>=1.1.0->haystack-ai) (0.26.0)\n",
      "Requirement already satisfied: pydantic<3,>=1.9.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from openai>=1.1.0->haystack-ai) (1.10.9)\n",
      "Requirement already satisfied: sniffio in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from openai>=1.1.0->haystack-ai) (1.3.0)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.3.2)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.6)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.2.0)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2024.2.2)\n",
      "Collecting soupsieve>1.2 (from beautifulsoup4->wikipedia)\n",
      "  Using cached soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)\n",
      "Requirement already satisfied: numpy in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from haystack-bm25->haystack-ai) (1.26.4)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from jinja2->haystack-ai) (2.1.5)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from pandas->haystack-ai) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2020.1 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from pandas->haystack-ai) (2024.1)\n",
      "Requirement already satisfied: tzdata>=2022.7 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from pandas->haystack-ai) (2024.1)\n",
      "Requirement already satisfied: six>=1.5 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from posthog->haystack-ai) (1.16.0)\n",
      "Requirement already satisfied: monotonic>=1.5 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from posthog->haystack-ai) (1.6)\n",
      "Requirement already satisfied: backoff>=1.10.0 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from posthog->haystack-ai) (2.2.1)\n",
      "Requirement already satisfied: httpcore==1.* in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from httpx<1,>=0.23.0->openai>=1.1.0->haystack-ai) (1.0.2)\n",
      "Requirement already satisfied: h11<0.15,>=0.13 in /Users/tuanacelik/opt/anaconda3/envs/mistral/lib/python3.12/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai>=1.1.0->haystack-ai) (0.14.0)\n",
      "Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)\n",
      "Using cached soupsieve-2.5-py3-none-any.whl (36 kB)\n",
      "Building wheels for collected packages: wikipedia\n",
      "  Building wheel for wikipedia (setup.py): started\n",
      "  Building wheel for wikipedia (setup.py): finished with status 'done'\n",
      "  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=17926b00d77f1d294e927f20b0e52e7a137fcd6b219ca85e63570f3b5c7d58f3\n",
      "  Stored in directory: /Users/tuanacelik/Library/Caches/pip/wheels/63/47/7c/a9688349aa74d228ce0a9023229c6c0ac52ca2a40fe87679b8\n",
      "Successfully built wikipedia\n",
      "Installing collected packages: soupsieve, beautifulsoup4, wikipedia\n",
      "Successfully installed beautifulsoup4-4.12.3 soupsieve-2.5 wikipedia-1.4.0\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "pip install haystack-ai wikipedia sentence-transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enable Telemetry\n",
    "\n",
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.telemetry import tutorial_running\n",
    "\n",
    "tutorial_running(39)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing Documents with Metadata\n",
    "\n",
    "Create a pipeline to store the small example dataset in the [InMemoryDocumentStore](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore) with their embeddings. We will use [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) to generate embeddings for your Documents and write them to the document store with the [DocumentWriter](https://docs.haystack.deepset.ai/docs/documentwriter).\n",
    "\n",
    "After adding these components to your pipeline, connect them and run the pipeline.\n",
    "\n",
    "> 💡 The `InMemoryDocumentStore` is the simplest document store to run tutorials with and comes with no additional requirements. This can be changed to any of the other available document stores such as **Weaviate, AstraDB, Qdrant, Pinecone and more**. Check out the [full list of document stores](https://haystack.deepset.ai/integrations?type=Document+Store) with instructions on how to run them. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we'll create a helper function that can create indexing pipelines. We will optionally provide this function with `meta_fields_to_embed`. If provided, the `SentenceTransformersDocumentEmbedder` will be initialized with metadata to embed alongside the content of the document.\n",
    "\n",
    "For example, the embedder below will be embedding the \"url\" field as well as the contents of documents:\n",
    "\n",
    "```python\n",
    "from haystack.components.embedders import SentenceTransformersDocumentEmbedder\n",
    "\n",
    "embedder = SentenceTransformersDocumentEmbedder(meta_fields_to_embed=[\"url\"])\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack import Pipeline\n",
    "from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter\n",
    "from haystack.components.embedders import SentenceTransformersDocumentEmbedder\n",
    "from haystack.components.writers import DocumentWriter\n",
    "from haystack.document_stores.types import DuplicatePolicy\n",
    "from haystack.utils import ComponentDevice\n",
    "\n",
    "\n",
    "def create_indexing_pipeline(document_store, metadata_fields_to_embed=None):\n",
    "    document_cleaner = DocumentCleaner()\n",
    "    document_splitter = DocumentSplitter(split_by=\"sentence\", split_length=2)\n",
    "    document_embedder = SentenceTransformersDocumentEmbedder(\n",
    "        model=\"thenlper/gte-large\", meta_fields_to_embed=metadata_fields_to_embed\n",
    "    )\n",
    "    document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)\n",
    "\n",
    "    indexing_pipeline = Pipeline()\n",
    "    indexing_pipeline.add_component(\"cleaner\", document_cleaner)\n",
    "    indexing_pipeline.add_component(\"splitter\", document_splitter)\n",
    "    indexing_pipeline.add_component(\"embedder\", document_embedder)\n",
    "    indexing_pipeline.add_component(\"writer\", document_writer)\n",
    "\n",
    "    indexing_pipeline.connect(\"cleaner\", \"splitter\")\n",
    "    indexing_pipeline.connect(\"splitter\", \"embedder\")\n",
    "    indexing_pipeline.connect(\"embedder\", \"writer\")\n",
    "\n",
    "    return indexing_pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can index our documents from various wikipedia articles. We will create 2 indexing pipelines:\n",
    "\n",
    "- The `indexing_pipeline`: which indexes only the contents of the documents. We will index these documents into `document_store`.\n",
    "- The `indexing_with_metadata_pipeline`: which indexes meta fields alongside the contents of the documents. We will index these documents into `document_store_with_embedded_metadata`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "modules.json: 100%|██████████| 349/349 [00:00<00:00, 561kB/s]\n",
      "config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 360kB/s]\n",
      "README.md: 100%|██████████| 93.0k/93.0k [00:00<00:00, 970kB/s]\n",
      "sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 156kB/s]\n",
      "config.json: 100%|██████████| 743/743 [00:00<00:00, 2.25MB/s]\n",
      "model.safetensors: 100%|██████████| 133M/133M [02:41<00:00, 826kB/s]  \n",
      "tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 1.92MB/s]\n",
      "vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.09MB/s]\n",
      "tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 1.11MB/s]\n",
      "special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 374kB/s]\n",
      "1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 596kB/s]\n",
      "Batches: 100%|██████████| 2/2 [00:40<00:00, 20.03s/it]\n",
      "Batches: 100%|██████████| 2/2 [00:38<00:00, 19.09s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'writer': {'documents_written': 47}}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import wikipedia\n",
    "from haystack import Document\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "\n",
    "some_bands = \"\"\"The Beatles,The Cure\"\"\".split(\",\")\n",
    "\n",
    "raw_docs = []\n",
    "\n",
    "for title in some_bands:\n",
    "    page = wikipedia.page(title=title, auto_suggest=False)\n",
    "    doc = Document(content=page.content, meta={\"title\": page.title, \"url\": page.url})\n",
    "    raw_docs.append(doc)\n",
    "\n",
    "document_store = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\n",
    "document_store_with_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\n",
    "\n",
    "indexing_pipeline = create_indexing_pipeline(document_store=document_store)\n",
    "indexing_with_metadata_pipeline = create_indexing_pipeline(\n",
    "    document_store=document_store_with_embedded_metadata, metadata_fields_to_embed=[\"title\"]\n",
    ")\n",
    "\n",
    "indexing_pipeline.run({\"cleaner\": {\"documents\": raw_docs}})\n",
    "indexing_with_metadata_pipeline.run({\"cleaner\": {\"documents\": raw_docs}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing Retrieval With and Without Embedded Metadata\n",
    "\n",
    "As a final step, we will be creating a retrieval pipeline that will have 2 retrievers:\n",
    "- First: retrieving from the `document_store`, where we have not embedded metadata.\n",
    "- Second: retrieving from the `document_store_with_embedded_metadata`, where we have embedded metadata.\n",
    "\n",
    "We will then be able to compare the results and see if embedding metadata has helped with retrieval in this case.\n",
    "\n",
    "> 💡 Here, we are using the `InMemoryEmbeddingRetriever` because we used the `InMemoryDocumentStore` above. If you're using another document store, change this to use the accompanying embedding retriever for the document store you are using. Check out the [Embedders Documentation](https://docs.haystack.deepset.ai/docs/embedders) for a full list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<haystack.pipeline.Pipeline at 0x19e0933b0>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from haystack.components.embedders import SentenceTransformersTextEmbedder\n",
    "from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\n",
    "\n",
    "retrieval_pipeline = Pipeline()\n",
    "retrieval_pipeline.add_component(\"text_embedder\", SentenceTransformersTextEmbedder(model=\"thenlper/gte-large\"))\n",
    "retrieval_pipeline.add_component(\n",
    "    \"retriever\", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3)\n",
    ")\n",
    "retrieval_pipeline.add_component(\n",
    "    \"retriever_with_embeddings\",\n",
    "    InMemoryEmbeddingRetriever(document_store=document_store_with_embedded_metadata, scale_score=False, top_k=3),\n",
    ")\n",
    "\n",
    "retrieval_pipeline.connect(\"text_embedder\", \"retriever\")\n",
    "retrieval_pipeline.connect(\"text_embedder\", \"retriever_with_embeddings\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run the pipeline and compare the results from `retriever` and `retirever_with_embeddings`. Below you'll see 3 documents returned by each retriever, ranked by relevance.\n",
    "\n",
    "Notice that with the question \"Have the Beatles ever been to Bangor?\", the first pipeline is not returning relevant documents, but the second one is. Here, the `meta` field \"title\" is helpful, because as it turns out, the document that contains the information about The Beatles visiting Bangor does not contain a reference to \"The Beatles\". But, by embedding metadata, the embedding model is able to retrieve the right document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = retrieval_pipeline.run({\"text_embedder\": {\"text\": \"Have the Beatles ever been to Bangor?\"}})\n",
    "\n",
    "print(\"Retriever Results:\\n\")\n",
    "for doc in result[\"retriever\"][\"documents\"]:\n",
    "    print(doc)\n",
    "\n",
    "print(\"Retriever with Embeddings Results:\\n\")\n",
    "for doc in result[\"retriever_with_embeddings\"][\"documents\"]:\n",
    "    print(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next\n",
    "\n",
    "🎉 Congratulations! You've embedded metadata while indexing, to improve the results of retrieval!\n",
    "\n",
    "If you liked this tutorial, there's more to learn about Haystack 2.0:\n",
    "- [Creating a Hybrid Retrieval Pipeline](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)\n",
    "- [Building Fallbacks to Websearch with Conditional Routing](https://haystack.deepset.ai/tutorials/36_building_fallbacks_with_conditional_routing)\n",
    "- [Model-Based Evaluation of RAG Pipelines](https://haystack.deepset.ai/tutorials/35_model_based_evaluation_of_rag_pipelines)\n",
    "\n",
    "To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates) or [join Haystack discord community](https://discord.gg/haystack).\n",
    "\n",
    "Thanks for reading!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mistral",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
